🐋DeepSeek-R1-0528: How to Run Locally
A guide on how to run DeepSeek-R1-0528 including Qwen3 on your own local device!
Last updated
Was this helpful?
A guide on how to run DeepSeek-R1-0528 including Qwen3 on your own local device!
Last updated
Was this helpful?
DeepSeek-R1-0528 is DeepSeek's new update to their R1 reasoning model. The full 671B parameter model requires 715GB of disk space. The quantized dynamic 1.66-bit version uses 162GB (-80% reduction in size). GGUF: DeepSeek-R1-0528-GGUF
New TQ1_0 dynamic 1.66-bit quant which is 162GB in size. Perfect for 192GB RAM (or unified memory) and those who use Ollama setups.
Try: ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0
DeepSeek also released a R1-0528 distilled version by fine-tuning Qwen3 (8B). The distill achieves similar performance to Qwen3 (235B). You can also with Unsloth. Qwen3 GGUF: DeepSeek-R1-0528-Qwen3-8B-GGUF
These uploads use our Unsloth Dynamic 2.0 methodology and calibration dataset, delivering the best performance on 5-shot MMLU and KL Divergence benchmarks. This means, you can run & fine-tune quantized DeepSeek LLMs with minimal accuracy loss!
Tutorials navigation:
For DeepSeek-R1-0528-Qwen3-8B, the model can pretty much fit in any setup, and even those with as less as 20GB RAM. There is no need for any prep beforehand. However, for the full R1-0528 model which is 715GB in size, you will need extra prep. The 1.78-bit (IQ1_S) quant will fit in a 1x 24GB GPU (with all layers offloaded). Expect around 5 tokens/s with this setup if you have bonus 128GB RAM as well.
It is recommended to have at least 64GB RAM to run this quant (you will get 1 token/s without a GPU). For optimal performance you will need at least 180GB unified memory or 180GB combined RAM+VRAM for 5+ tokens/s.
We suggest using our 2.7bit (Q2_K_XL) or 2.4bit (IQ2_XXS) quant to balance size and accuracy! The 2.4bit one also works well.
Though not necessary, for the best performance, have your VRAM + RAM combined = to the size of the quant you're downloading.
According to DeepSeek, these are the recommended settings for R1 (R1-0528 and Qwen3 distill should use the same settings) inference:
Set the temperature 0.6 to reduce repetition and incoherence.
Set top_p to 0.95 (recommended)
Run multiple tests and average results for reliable evaluation.
R1-0528 uses the same chat template as the original R1 model. You do not need to force <think>\n
, but you can still add it in!
<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>It's 2.<|end▁of▁sentence|><|User|>Explain more!<|Assistant|>
A BOS is forcibly added, and an EOS separates each interaction. To counteract double BOS tokens during inference, you should only call tokenizer.encode(..., add_special_tokens = False)
since the chat template auto adds a BOS token as well.
For llama.cpp / GGUF inference, you should skip the BOS since it’ll auto add it:
<|User|>What is 1+1?<|Assistant|>
The <think>
and </think>
tokens get their own designated tokens.
ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.
Qwen3 (8B) distill: DeepSeek-R1-0528-Qwen3-8B-GGUF
Full DeepSeek-R1-0528 model uploads below:
We also uploaded IQ4_NL and Q4_1 quants which run specifically faster for ARM and Apple devices respectively.
1.66bit
162GB
1.92/1.56bit
1.78bit
185GB
2.06/1.56bit
1.93bit
200GB
2.5/2.06/1.56
2.42bit
216GB
2.5/2.06bit
2.71bit
251GB
3.5/2.5bit
3.12bit
273GB
3.5/2.06bit
3.5bit
346GB
4.5/3.5bit
4.5bit
384GB
5.5/4.5bit
5.5bit
481GB
6.5/5.5bit
We've also uploaded versions in BF16 format, and original FP8 (float8) format.
Run the model! Note you can call ollama serve
in another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params
in our Hugging Face upload!
(NEW) To run the full R1-0528 model in Ollama, you can use our TQ1_0 (162GB quant):
Open WebUI has made an step-by-step tutorial on how to run R1 here and for R1-0528, you will just need to replace R1 with the new 0528 quant: docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
(NEW) To run the full R1-0528 model in Ollama, you can use our TQ1_0 (162GB quant):
If you want to use Ollama for inference on GGUFs, you need to first merge the 3 GGUF split files into 1 like the code below. Then you will need to run the model locally.
Then use llama.cpp directly to download the model:
Obtain the latest llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.
If you want to use llama.cpp
directly to load models, you can do the below: (:IQ1_S) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run
. Use export LLAMA_CACHE="folder"
to force llama.cpp
to save to a specific location.
Please try out -ot ".ffn_.*_exps.=CPU"
to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.
If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU"
This offloads up and down projection MoE layers.
Try -ot ".ffn_(up)_exps.=CPU"
if you have even more GPU memory. This offloads only up projection MoE layers.
And finally offload all layers via -ot ".ffn_.*_exps.=CPU"
This uses the least VRAM.
You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU"
means to offload gate, up and down MoE layers but only from the 6th layer onwards.
Download the model via (after installing pip install huggingface_hub hf_transfer
). You can choose UD-IQ1_S
(dynamic 1.78bit quant) or other quantized versions like Q4_K_M
. I recommend using our 2.7bit dynamic quant UD-Q2_K_XL
to balance size and accuracy. More versions at: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF
Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.
Edit --threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
You can also test our dynamic quants via r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.
To fine-tune DeepSeek-R1-0528-Qwen3-8B with Unsloth, simply swap out the original model in our Qwen3 fine-tuning notebook with the R1 distilled version.
We are still in the process of making a reasoning + conversational notebook and a GRPO notebook so stay tuned!
Unsloth makes Qwen3 distill fine-tuning 2× faster, uses 70% less VRAM, and support 8× longer context lengths.
•
•
•
Install ollama
if you haven't already! You can only run models up to 32B in size. To run the full 720GB R1-0528 model, .
To run the full 720GB R1-0528 model, . Obtain the latest llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.